Language trees and zipping.
نویسندگان
چکیده
In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification.
منابع مشابه
Comment on"Language Trees and Zipping"arXiv:cond-mat/0108530
every encoding has priori information if the encoding represents any semantic information of the unverse or object.Encoding means mapping from the unverse to the string or strings of digits. The semantic here is used in the model-theoretic sense or denotation of the object.if encoding or strings of symbols is the adequate and true mapping of model or object,and the mapping is recursive or compu...
متن کاملComment on "Language Trees and Zipping"
This is the extended version of a Comment submitted to Physical Review Letters. I first point out the inappropriateness of publishing a Letter unrelated to physics. Next, I give experimental results showing that the technique used in the Letter is 3 times worse and 17 times slower than a simple baseline. And finally, I review the literature, showing that the ideas of the Letter are not novel. I...
متن کاملExtended Comment on Language Trees and Zipping
This is the extended version of a Comment submitted to Physical Review Letters. I first point out the inappropriateness of publishing a Letter unrelated to physics. Next, I give experimental results showing that the technique used in the Letter is 3 times worse and 17 times slower than a simple baseline. And finally, I review the literature, showing that the ideas of the Letter are not novel. I...
متن کاملThe Recent Letter " Language, Trees and Zipping " [1] Suggests Using Standard
compression programs to solve a number of problems. Unfortunately, the ideas are well known, and the technique, tested on a standard problem, is at least a factor of three worse than a simple baseline. In particular, the ideas of this Letter are very well known in several fields of Computers Science, including Machine Learning and Statistical Natural Language Processing. This Letter is essentia...
متن کاملLanguage trees, zipping and error estimation
A method was recently proposed to estimate distances between a pair of given texts. The distance estimation appeared to be reliable enough to infer a phylogenic tree of languages, even though no error estimation has been provided. This essay reviews the method and explains its application for inferring phylogeny on a collection of heterogeneous texts. An approach for estimating the confidence o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Physical review letters
دوره 88 4 شماره
صفحات -
تاریخ انتشار 2002